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Abstract 

The data set simulated for Genetic Analysis Workshop 17 was designed to mimic a subset of data that might be 
produced in a full exome screen for a complex disorder and related risk factors in order to permit workshop 
participants to investigate issues of study design and statistical genetic analysis. Real sequence data from the 1000 
Genomes Project formed the basis for simulating a common disease trait with a prevalence of 30% and three 
related quantitative risk factors in a sample of 697 unrelated individuals and a second sample of 697 individuals in 
large, extended pedigrees. Called genotypes for 24,487 autosomal markers assigned to 3,205 genes and simulated 
affection status, quantitative traits, age, sex, pedigree relationships, and cigarette smoking were provided to 
workshop participants. The simulating model included both common and rare variants with minor allele 
frequencies ranging from 0.07% to 25.8% and a wide range of effect sizes for these variants. Genotype-smoking 
interaction effects were included for variants in one gene. Functional variants were concentrated in genes selected 
from specific biological pathways and were selected on the basis of the predicted deleteriousness of the coding 
change. For each sample, unrelated individuals and family, 200 replicates of the phenotypes were simulated. 



Background 

The state of the science for localization and identifica- 
tion of genes that influence common complex diseases 
has changed rapidly over the past 20 years. As labora- 
tory costs continue to fall with the development of 
more efficient high-throughput techniques, the field is 
quickly proceeding toward studies that make use of gen- 
ome-wide sequence data. There is as yet no consensus 
on optimal, or even appropriate, statistical genetic 
approaches for analyzing exome sequence data, and few 
investigators have had experience analyzing such data 
sets. This was the motivation for the Genetic Analysis 
Workshop 17 (GAW17) "mini-exome" data set. The 
GAW17 data set is a hybrid of simulated and real data. 
Real exome sequence data from the 1000 Genomes Pro- 
ject were used as the basis for simulating a common 
complex disease and related quantitative risk factors. 
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Two different study designs were simulated, unrelated 
individuals and large families, each with the same sam- 
ple size. 

1000 Genomes Project 

The 1000 Genomes Project (http://www.l000genomes. 
org) is designed to survey genetic variation at the 
sequence level across multiple human population 
groups. It includes individuals of European, East Asian, 
South Asian, West African, and American Indian ances- 
try. Three pilot projects for the 1000 Genomes Project 
were completed in 2010: low-fold genome-wide sequen- 
cing of 179 individuals, higher fold sequencing of two 
parent-child trios, and exonic sequencing in 697 indivi- 
duals [1]. Publicly available exon sequence data from the 
1000 Genomes Project were used to provide a realistic 
pattern of number and frequency of single-nucleotide 
polymorphisms (SNPs), including cross-population var- 
iation and linkage disequilibrium between sites, for the 
GAW17 simulations. 
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Methods 

Genotype calling 

SNP genotypes were obtained from the sequence align- 
ment files provided by the 1000 Genomes Project for 
their pilot3 study. When the GAW17 data set was gen- 
erated, the 1000 Genomes Project had not yet posted 
processed calls of these genotypes for each individual. 
Thus the UnifiedGenotyper method from the Genome 
Analysis Toolkit (GATK) package [2] was used for the 
detection of SNPs and for the calling of SNP genotypes. 
A male human genome based on National Center for 
Biotechnology Information reference sequence 36 
(RefSeq36) human genome release (human_b36_male. 
fasta.gz) was used as the reference genome sequence for 
both male and female alignments. 

The UnifiedGenotyper method was run twice on the 
alignment files. The first time it was allowed to scan 
freely through the alignments to search for variation 
against the reference sequence to be considered as SNP 
candidates. Genotypes that were not homozygous for 
the reference base allele were called for the candidate 
SNPs detected. Because of time and technical con- 
straints, GAW17 SNPs were chosen to be the subset of 
candidate SNP genotypes that were called from an align- 
ment of 10 or more sequencing reads. During the sec- 
ond run, genotypes, including those homozygous for the 
reference base, were called only for the subset of SNPs 
selected in the first run. 

This procedure had the advantages of being fast, cor- 
rectly calling most of the true common SNP variants, gen- 
erating a large volume of rare SNP variants, and 
producing a genotype matrix with few missing genotypes 
to simplify downstream preparation of the simulated data 
set. However, it was not meant to detect the true natural 
variation present in the 1000 Genomes Project. Thus there 
were more rare SNPs in the GAW17 data set than those 
described in the 1000 Genomes Project Consortium ana- 
lyses of their own pilot data sets [1]. The enrichment of 
rare variants in the GAW17 data set was caused in part by 
artifacts introduced by, for example, lack of filtering. 

The 1000 Genomes Project genotypes were not 
phased, and some genotypes were missing as a result of 
incomplete sequence coverage in some individuals. We 
used the program fastPHASE [http://depts.washington. 
edu/uwc4c/express-licenses/assets/fastphase/] to infer 
missing genotypes and haplotypic phase. In the family 
data set (described later), we used the program CHRSIM 
[3] to drop the phased founder haplotypes throughout 
the rest of the pedigree. Recombination was taken into 
account, with a single obligate crossover event occurring 
on each chromosome. 

As noted, the GAW17 genotypes differ from the offi- 
cial 1000 Genomes Project called genotypes for the 



same individuals because of differing approaches to gen- 
otype calling, inclusion or exclusion of regions of low- 
fold coverage, and the inclusion of the imputed geno- 
types in the GAW17 data set. Imputed genotypes were 
not identified as such in the distributed data and were 
treated as equally "real" as called genotypes in the phe- 
notype simulations. These choices were motivated by a 
focus on designing a data set that would be useful for 
developing methods related to gene localization, identifi- 
cation, and characterization, with the 1000 Genomes 
Project data primarily serving as a source of sequence 
data with realistic patterns of SNP distribution, allele 
frequency, population variation, and linkage disequili- 
brium. However, these decisions limited the utility of 
the GAW17 data for population genetic analyses or for 
examination of effects of genotype calling or data clean- 
ing on gene finding. 

Distributed genotype data 

The called genotype data distributed for GAW17 
included the inferred genotypes, such that all individuals 
had genotypes for all base-pair positions, and phenotypes 
were simulated on the basis of these data. Markers were 
numbered sequentially on each chromosome and were 
labeled C#S# (e.g., C1S254 is the 254th SNP on chromo- 
some 1); marker locations were recorded as RefSeq36 
base-pair coordinates. The 24,487 autosomal SNPs 
detected in genotype calling were, for purposes of the 
simulation, assigned to 3,205 genes based on the first 
intersection found of the marker location and the base- 
pair coordinates of all genes obtained from RefSeq36 
annotations. SNPs that overlapped multiple genes were 
assigned to only one of those genes. There were 1 to 231 
SNPs per gene (mean = 7.64, SD = 14.00). Of the SNPs, 
9,433 (38.4%) were private variants, occurring once in the 
set of 697 unrelated individuals. Multiple private variants 
carried by the same individual resulted in SNPs with 
identical genotypes, including a SNP in the KDR gene 
that was designated as functional in the phenotype simu- 
lations which had identical genotypes to multiple non- 
functional SNPs. Relatively few of the variants were 
common; 74% had minor allele frequency (MAF) < 0.01 
and only 12.8% had MAF > 0.05 (Figure 1). The median 
MAF was 0.002, that is, three copies of the minor allele 
in the sample of 697 unrelated individuals. 

Unrelated individuals and pedigree samples 

Two disparate sampling designs were used in the con- 
struction of the simulated data. One sample consisted of 
697 unrelated individuals, each of whom corresponded 
to an individual from the 1000 Genomes Project data. 
The 1000 Genomes Project subjects whose data were 
used came from the CEPH (European-descent residents 
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Figure 1 Minor allele frequency in the unrelated individuals 
sample for the 15,054 SNPs present in multiple copies Note 
that the scale of the MAF categories is uneven, going by 0.5% 
intervals for MAF < 0.01, by 1% intervals for MAF = 0.01-0.05, and 
by 5% intervals thereafter. 



of Utah; n = 90), Denver Chinese (n = 107), Han Chi- 
nese (n = 109), Japanese (n = 72), Luhya (n = 108), Tus- 
can (n = 66), and Yoruban [n = 112) population groups. 
The second sample configuration used in GAW17 simu- 
lations consisted of 697 individuals in 8 extended 
families with the genotypes for the 202 pedigree foun- 
ders being taken from the 1000 Genomes Project data. 
The founders of the pedigrees were chosen at random 
from the unrelated individuals sample and included 12 
CEPH, 18 Denver Chinese, 19 Han Chinese, 28 Japa- 
nese, 50 Luhya, 66 Tuscan, and 9 Yoruban samples. 
Because of a computational error, the genotypes of pedi- 
gree founders were merged incorrectly across files, 
resulting in an incomplete match between the genotypes 
of the pedigree founders and the corresponding indivi- 
dual from the unrelated individuals sample. This 
affected a small proportion of genotypes (7%) but 
impacted all pedigree founders. Approximately one-third 
of the SNPs were unaffected, and for the two-thirds that 
had substitutions, most had only one or two founders 
with altered genotypes. Pedigree configurations were 
adapted from the pedigrees used for simulated data in 
Genetic Analysis Workshops 10 and 12 [4,5] and 
included four generations and relatives as distant as sec- 
ond cousins. The data set was designed such that all 
family members had genotype and phenotype data avail- 
able with no missing or unexamined relatives. 

Because the pedigree founders were a subset of the 
unrelated individuals, genetic diversity was restricted in 
the families compared to the unrelated individuals sam- 
ple. Of the 24,487 variant sites identified in the unre- 
lated individuals sample, 10,703 were monomorphic in 
the family sample with only one allele appearing. On the 



other hand, some variants that were present in single 
copies or at low frequency in the unrelated individuals 
sample appeared many times in the family sample, 
because they were transmitted by a founder with 
numerous descendants. For example, C6S2981, which 
was designated as functional in the phenotype simula- 
tions, was present in 3 copies in the unrelated indivi- 
duals sample and in 46 copies in the family sample. 
C4S4935, also designated as functional in the simula- 
tions, was present in a single copy in the unrelated indi- 
viduals sample but in 31 copies in the pedigree sample. 

There are 327 males and 370 females in the unrelated 
individuals data set, which preserved the listed sex for 
each of the 1000 Genomes Project samples. The family 
set included 346 males and 351 females. Pedigree foun- 
ders were allowed to have a different sex from the unre- 
lated individuals whose genotypes they shared. However, 
only autosomal markers were used in the GAW17 simu- 
lations (i.e., X and Y data were not included). Assigned 
ages were matched across the family and unrelated indi- 
viduals data sets and ranged from 16 to 91 years, with a 
mean of 41.8 years. 

For the family data set, fully informative markers were 
generated at each gene (recombination was not allowed 
within genes) and used to compute identical-by-descent 
(IBD) allele sharing at each gene location under the 
rationale that family-based data sets were likely to have 
previous short tandem repeat (STR) or high-density 
SNP genotyping that could be used to estimate the IBD 
allele sharing. These IBD matrices were provided as part 
of the GAW17 data set. 

Phenotype model 

A common disease, with a prevalence of 30%, was simu- 
lated along with three related quantitative risk factors, 
Ql, Q2, and Q4. Smoking status (prevalence 25%) was 
also simulated. Phenotype simulations were performed 
multiple times to generate 200 replicates of the unre- 
lated individuals and pedigree data sets. Note that the 
genotype data remained constant across replicates, as 
did age, sex, and pedigree configuration. 

Knowledge about biological pathways and statistical 
predictions regarding the potential deleteriousness of 
coding variants was used in designing the simulation 
model. Pathways for gene enrichment were selected from 
the publicly available Kyoto Encyclopedia of Genes and 
Genomes (KEGG) database (http://www.genome.jp/kegg/) 
and the proprietary software Ingenuity Pathways Analysis 
(IP A), version 8.7 (http://www.ingenuity.com). The vascu- 
lar endothelial growth factor (VEGF) pathway was 
observed to have numerous genes with available typed 
SNPs and was therefore selected as the source of a subset 
of the functional loci for phenotype simulations. The IPA 
version of the VEGF signaling pathway was used as the 
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core source because it included most of the genes in the 
KEGG VEGF signaling pathway as well as some addi- 
tional upstream information, primarily relating to VEGF 
transcriptional control. The overlap between the two data 
sources was considered significant enough not to impede 
the investigations of any researchers limited to the freely 
available KEGG data set. 

Genes influencing Ql come primarily from the VEGF 
pathway; those influencing Q2 were chosen without 
reference to pathways and were primarily related to car- 
diovascular disease risk and inflammation, and those 
influencing latent disease liability also came primarily 
from the VEGF pathway (a different section from the 
one in which genes were selected for Ql). Effect sizes 
for coding variants within genes were assigned using 
PolyPhen and SIFT predictions of the likelihood that the 
variant would be deleterious. The functional variants 



included both rare and common alleles and a range of 
effect sizes, with most having small effects but a few 
having large effects that should be reliably detectable in 
most replicates of the data set. Some genes contained a 
single functional variant and others contained many. 
Population origin of the 1000 Genomes Project partici- 
pants was not used in the phenotype simulations. In 
general, there was little disequilibrium between the 
functional variants (Figures 2, 3, 4), with a few excep- 
tions that were primarily private variants carried in a 
single copy by the same individual (e.g., C3S4836 and 
C10S3092 for Q2 and C4S1877 and C4S1889 for Ql). 

Quantitative risk factors Ql, Q2, and Q4 were simu- 
lated as normally distributed phenotypes. Disease was 
simulated using a liability threshold model, and the top 
30% of the distribution was declared affected. All SNP 
effects were additive on the quantitative trait or liability 
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Figure 3 Gametic disequilibrium (r 2 ) between functional variants for Q2. Markers are shown in chromosomal order from bottom to top 
and from left to right and are symmetric across the diagonal. 



scale, with each copy of the minor allele increasing the 
mean trait value by an equal amount. Genotype by 
environment effects were simulated for Ql. Because 
genotype, age, and sex were held constant across repli- 
cates, the variation in phenotype across replicates came 
primarily from the residual polygenic and residual envir- 
onmental components. The residual polygenic compo- 
nents were correlated between relatives, by definition, 
and also correlated between Ql, Q2, and latent liability. 
The residual environmental components were unique to 
each individual and were simulated to be weakly corre- 
lated between Ql, Q2, and latent liability. 

Ql 

Quantitative risk factor Ql was influenced by 39 SNPs 
in 9 genes (see Table 1). There were 1-11 functional 



variants per gene. Their MAFs in the 1000 Genomes 
Project data ranged from 0.07% (i.e., a single copy of the 
minor allele) to 16.5%. In all cases, the minor allele was 
associated with higher mean Ql; the /J column in the 
table provides the displacement in mean levels of Ql for 
each copy of the minor allele. Ql also had a residual 
heritability of 0.44, resulting from variants at loci not 
included in the current sequence data set. The residual 
genetic component of Ql was correlated with the resi- 
dual genetic components of Q2 and latent liability. 
There were also weaker environmental correlations 
between Ql and Q2 and latent liability. Values of Ql 
were higher in smokers, and there was genotype by 
smoking interaction for the effects of variants in the 
KDR gene on Ql. Effects of the KDR variants were 50% 
higher in smokers than in nonsmokers. (Note that for 
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Figure 4 Gametic disequilibrium (r ) between functional variants for latent liability Markers are shown in chromosomal order from 
bottom to top and from left to right and are symmetric across the diagonal. 



KDR the effect sizes given in Table 1 are those for non- 
smokers.) Ql also increased with age. 

Q2 

Q2 was influenced by 72 SNPs in 13 genes (see Table 2). 
There were 1-13 functional variants per gene. MAFs ran- 
ged from 0.07% to 17.07%. In all cases, the minor allele 
was associated with higher mean Q2. Q2 had a residual 
heritability of 0.29. The residual genetic component of 
Q2 was correlated with the residual genetic components 
of Ql and latent liability. There were also weaker envir- 
onmental correlations between Q2 and Ql and latent lia- 
bility. Q2 was not influenced by age, sex, or smoking 
status. 



Q4 

Q4 had a heritability of 0.70, but none of this genetic 
component was due to genes in this sequencing set (i.e., 
it was not influenced by any of the genotyped exonic 
SNPs). Q4 was lower in smokers, decreased with age, and 
was lower in females. Q4 was protective; that is, indivi- 
duals with higher levels of Q4 had lower risk of disease. 

Affection status 

A normally distributed latent liability trait (not included 
in the distributed phenotype data) was simulated; it was 
influenced by 51 SNPs in 15 genes with 1-24 functional 
variants per gene (see Table 3). MAFs of these variants 
ranged from 0.07% to 25.8%. In all cases, the minor 
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Table 1 Effects on Q1 


Gene 


SNP 


MAF 


P 


ARNT 


C1S6533 


0.01 1478 


0.589734 


ARNT 


C1S6537 


0.00071 7 


0.642689 


ARNT 


C1 S6540 


0 00143S 


0 323662 


ARNT 


C1S6542 


00021 S2 

\J t \J\J 4- 1 -J Z- 


0488219 


ARNT 


C1S6561 


0.00071 7 


0.625721 


ELAVL4 


CI S3 1 8 1 


0.00071 7 


0.795093 


ELAVL4 


C1S3182 


0.00071 7 


0.328748 


FLT1 


CI 3S320 


0.001435 


0.18047 


FLT1 


CI 3S399 

V— 1 -J -J -J J J 


0 000717 


0.457361 


FLT1 


C13S431 


0.017217 


0.732566 


TIT] 


C13S479 


0.00071 7 


0.839669 


TIT] 


C13S505 


0 000717 


0.38582 


TIT] 


CI 3S514 


000071 7 


0 549816 

UJ^2U 1 \J 


TIT] 


C13S522 


0.027977 


0.623466 


TIT] 


C13S523 


0066714 


0.653351 


TIT] 


C13S524 


0.004304 


0.596704 


TIT] 


C13S547 


0.000717 


0.549214 


TIT] 


CI 3S567 


0 00071 7 


0 0905862 


TIT 4 


C5S51 33 


0001435 


0.120761 


TIT 4 


C5S5156 


0.00071 7 


0.385374 


HIT] A 


C14S1718 


0.00071 7 


0.251622 


HIT1 A 


C14S1 729 


0.0021 52 


0.329088 


HIT1 A 


C14S1 734 


0.012195 


0.220448 


HIT] A 


C14S1 736 


0 000717 


0.228202 


HIT3A 


C19S4799 


0 000717 


0 1 74668 


HIT3A 


C19S4815 


0.000717 


0.51468 


HIT3A 


C19S4831 


0.000717 


0.265181 


KDR 


C4S1861 


0.0021 52 


0.598271 


KDR 


C4S1873 


0.00071 7 


0.715613 


KDR 


C4S1 874 


0 00071 7 


0 503075 


KDR 


C4S1877 


0.000717 


1.17194 


KDR 


C4S1878 


0.164993 


0.149975 


KDR 


C4S1879 


0.000717 


0.610938 


KDR 


C4S1884 


0.020803 


0.318125 


KDR 


C4S1887 


0.000717 


0.312058 


KDR 


C4S1889 


0.000717 


1.17194 


KDR 


C4S1890 


0.002152 


0.417977 


VTGTA 


C6S2981 


0.002152 


1.13045 


VTGTC 


C4S4935 


0.000717 


1.40529 



allele was associated with higher mean liability. This 
latent liability trait was also higher in smokers and 
increased with age. Disease risk was a function of this 
latent liability, Ql, Q2, and Q4: 
Liability to disease = latent liability + Ql + Q2 - Q4. (1) 
Using this formula, a quantitative liability score was 
calculated for each individual, and the top 30% of the 
distribution in each simulation replicate was declared 
affected. A consequence of this assignment was that 
each replicate had the same number of affected 



Table 2 Effects on Q2 


Gene 


SNP 


MAF 


P 


BCHE 


T3S4834 


0 000717 


0 232562 


BCHE 


C3S4836 


0.000717 


0.352589 


BCHE 


C3S4856 


0 000717 


0.31 1344 


BCHE 


C3S4859 

LJJ^UJ2 


0 002152 


0.557489 


BCHE 


C3S4860 


0.000717 
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Table 2 Effects on Q2 (Continued) 
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individuals, although the identity of these individuals 
varied across replicates. The effect sizes in Table 3 are 
for liability to disease. 
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